This project contains an automated betting odds scraper for NBA matches. It does this for multiple bookies and returns the best price. Arbitrage and changepoint monitoring are part of the code, as well as monitoring differences in the odds from modelled win probabilities.
The chosen model was the well known Bradley-Terry model, trained on last seasons match results, scraped from ESPN. The model dynamically updates each day to include the new information in the previous days results.
The main components of the project are:
1. Scraping the odds from multiple bookies at regular
intervals. When matches are live, the scrape frequency increases.
2. Monitoring the best odds at each time period, and detecting
any arbitrages, sudden changes in the odds (for live matches), or
discrepancies with modelled probabilites.
3. Modelling and predicting the outcome of each match, which
updates automatically to include the results of the previous day. This
provides a reference point to compare odds to.
4. Notifying me of anything that comes up in monitoring, via an
automated email system.
The project is hosted on a digital ocean droplet running Ubuntu 22.04 (Linux). This allows for cron scheduling of the automated components, using Process Manager 2 (PM2).
To model each match, I chose a Bradley-Terry model. It is trained on head to head data, which I scraped from last season, and predicts outcomes by assigning each team i a ‘strength’ beta_i. It also allows for other effects, for example home advantage. If we have Team A at home playing Team B, then the probability A beats B is log(alpha+beta_A/beta_A+betaB).
# Pr{i beats j} = log(i/i+j)
I chose this model as it is a well known model for sports events, and I want to use it as a baseline to analyse the efficiency of the betting markets. In future I would like to extend this to the Bradley-Terry-Luce model and the Thurstone-Mosteller model, as well as a BTDecay. I will also be comparing these to win frequency as a control.
Below I have a plot of Bradley-Terry ‘strengths’ against win frequency. As we can see, the two are quite closely correlated, ~95%, however the differences arise from teams beating other strong or weak teams. (IS THIS TRUE?!)
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
The data:
Home advantage is a well known phenomenon in all sports, but it is still important before including home advantage in our Bradley Terry model to statistically verify it exists. To do this, I performed a chi-squared likelihood ratio test comparing a model with home advantage to one without. This output is contained below, and clearly we can reject the null hypothesis that the home advantage does not exist, as the p value is statistically significant at the 1% level.
Efficient betting markets should reflect all available information and therefore sports odds should reflect an accurate (but slightly lower) implied probability for a given outcome. However, if bookies base their odds on betting flow to give themselves a neutral position, then there may exist situations where bookies deviate from the true probabilities.
Occasionally, there is disagreement among bookies on the odds offered, and a well equipped bettor can place an arbitrage bet profit from this situation, irrespective of the outcome. This is the case whereby the implied probabilities of the outcomes sum to less than 1 (excl. the case of draws, such as in Basketball).
To investigate these phenomena, I needed to build a process that automatically scraped the odds from multiple bookies at regular intervals. I also needed to scrape the most recent match results automatically and update my model to reflect this new information. This process should then check for arbitrage opportunities and discrepancies between the modelled and implied probabilities.